from Clean_data import countries_subset
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from matplotlib import pyplot as plt
import pygal
from IPython.display import SVG, display
%matplotlib notebook
dateRep year_week cases_weekly deaths_weekly countriesAndTerritories \
0 04/01/2021 2020-53 902 60 Afghanistan
1 28/12/2020 2020-52 1994 88 Afghanistan
2 21/12/2020 2020-51 740 111 Afghanistan
3 14/12/2020 2020-50 1757 71 Afghanistan
4 07/12/2020 2020-49 1672 137 Afghanistan
geoId countryterritoryCode popData2019 continentExp \
0 AF AFG 38041757.0 Asia
1 AF AFG 38041757.0 Asia
2 AF AFG 38041757.0 Asia
3 AF AFG 38041757.0 Asia
4 AF AFG 38041757.0 Asia
notification_rate_per_100000_population_14-days
0 7.61
1 7.19
2 6.56
3 9.01
4 7.22
country country_code year_week new_cases tests_done population \
0 Austria AT 2020-W15 1954 12339 8858775
1 Austria AT 2020-W16 773 58488 8858775
2 Austria AT 2020-W17 479 33443 8858775
3 Austria AT 2020-W18 349 26598 8858775
4 Austria AT 2020-W19 249 42153 8858775
testing_rate positivity_rate testing_data_source
0 139.285624 15.835967 Manual webscraping
1 660.226724 1.321639 Manual webscraping
2 377.512692 1.432288 Manual webscraping
3 300.244673 1.312129 Country website
4 475.833284 0.590705 Country website
country indicator date year_week value \
0 Austria Daily hospital occupancy 2020-04-01 2020-W14 856.0
1 Austria Daily hospital occupancy 2020-04-02 2020-W14 823.0
2 Austria Daily hospital occupancy 2020-04-03 2020-W14 829.0
3 Austria Daily hospital occupancy 2020-04-04 2020-W14 826.0
4 Austria Daily hospital occupancy 2020-04-05 2020-W14 712.0
source url
0 Country_Website https://covid19-dashboard.ages.at/dashboard_Ho...
1 Country_Website https://covid19-dashboard.ages.at/dashboard_Ho...
2 Country_Website https://covid19-dashboard.ages.at/dashboard_Ho...
3 Country_Website https://covid19-dashboard.ages.at/dashboard_Ho...
4 Country_Website https://covid19-dashboard.ages.at/dashboard_Ho...
country indicator year_week value
1144 Belgium Weekly new hospital admissions per 100k 2020-W06 NaN
1145 Belgium Weekly new hospital admissions per 100k 2020-W07 NaN
1146 Belgium Weekly new hospital admissions per 100k 2020-W08 NaN
1147 Belgium Weekly new hospital admissions per 100k 2020-W09 NaN
1148 Belgium Weekly new hospital admissions per 100k 2020-W10 NaN
country year_week Weekly new ICU admissions per 100k \
0 Belgium 2020-11 NaN
1 Belgium 2020-12 NaN
2 Belgium 2020-13 NaN
3 Belgium 2020-14 NaN
4 Belgium 2020-15 NaN
Weekly new hospital admissions per 100k country_code new_cases \
0 13.303631 BE 1735
1 13.303631 BE 4614
2 30.474394 BE 8540
3 30.937053 BE 9890
4 24.520932 BE 9778
tests_done population testing_rate positivity_rate dateRep \
0 9924 11455519 86.630732 17.482870 16/03/2020
1 17066 11455519 148.976227 27.036212 23/03/2020
2 27141 11455519 236.925101 31.465311 30/03/2020
3 38426 11455519 335.436570 25.737782 06/04/2020
4 53517 11455519 467.172199 18.270830 13/04/2020
cases_weekly deaths_weekly
0 1735 31
1 4614 265
2 8540 874
3 9890 1671
4 9778 1999
['BE', 'CZ', 'DE', 'DK', 'EE', 'EL', 'FR', 'IE', 'IT', 'LV', 'MT', 'NL', 'PT', 'SE', 'SI']
year_week
2020-06 1
2020-07 1
2020-08 2
2020-09 4
2020-10 9
2020-11 14
2020-12 14
2020-13 15
2020-14 15
2020-15 15
2020-16 15
2020-17 15
2020-18 15
2020-19 15
2020-20 15
2020-21 15
2020-22 15
2020-23 15
2020-24 15
2020-25 15
2020-26 15
2020-27 15
2020-28 15
2020-29 15
2020-30 15
2020-31 15
2020-32 15
2020-33 15
2020-34 15
2020-35 15
2020-36 15
2020-37 15
2020-38 15
2020-39 15
2020-40 15
2020-41 15
2020-42 15
2020-43 14
2020-44 15
2020-45 15
2020-46 15
2020-47 15
2020-48 15
2020-49 15
2020-50 15
2020-51 15
2020-52 14
2020-53 14
dtype: int64
(649, 12)
countries_subset
| country | year_week | country_code | tests_done | population | testing_rate | positivity_rate | dateRep | cases_weekly | deaths_weekly | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Belgium | 2020-11 | BE | 9924 | 11455519 | 86.630732 | 17.482870 | 16/03/2020 | 1735 | 31 |
| 1 | Belgium | 2020-12 | BE | 17066 | 11455519 | 148.976227 | 27.036212 | 23/03/2020 | 4614 | 265 |
| 2 | Belgium | 2020-13 | BE | 27141 | 11455519 | 236.925101 | 31.465311 | 30/03/2020 | 8540 | 874 |
| 3 | Belgium | 2020-14 | BE | 38426 | 11455519 | 335.436570 | 25.737782 | 06/04/2020 | 9890 | 1671 |
| 4 | Belgium | 2020-15 | BE | 53517 | 11455519 | 467.172199 | 18.270830 | 13/04/2020 | 9778 | 1999 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 860 | Sweden | 2020-49 | SE | 261230 | 10230185 | 2553.521759 | 14.218122 | 07/12/2020 | 37142 | 446 |
| 861 | Sweden | 2020-50 | SE | 270944 | 10230185 | 2648.476054 | 16.016225 | 14/12/2020 | 43395 | 526 |
| 862 | Sweden | 2020-51 | SE | 299447 | 10230185 | 2927.092716 | 15.971107 | 21/12/2020 | 47825 | 379 |
| 863 | Sweden | 2020-52 | SE | 232114 | 10230185 | 2268.913026 | 16.840432 | 28/12/2020 | 39089 | 237 |
| 864 | Sweden | 2020-53 | SE | 232114 | 10230185 | 2268.913026 | 17.762823 | 04/01/2021 | 41230 | 103 |
649 rows × 10 columns
Based on the structure of the final dataset there is a set of questions which arise here.
Testing rate is a great way to compare countries by COVID testings, as it is calculated as number of tests per 100K of population. So, we can compare countries with different population level. Timeline below let us compare testing rate per countries at the different time slices.
plt.figure(figsize=(15,8))
plt.xticks(rotation=90)
ax = sns.lineplot( data=countries_subset, x="year_week", y="testing_rate", hue="country")#testing rate plot
ax.set(xlabel="Week of the year", ylabel = "Testing rate", title = 'Testing rate per 100 000 population')
[Text(0.5, 0, 'Week of the year'), Text(0, 0.5, 'Testing rate'), Text(0.5, 1.0, 'Testing rate per 100 000 population')]
Plot describes that during first few weeks Malta had the highest testing rate across all countries, probably because it has the smallest population rate. But most of the observed period Denmark was a leader by testing rate. This number increased obviously and the highest testing rate (more than 14000) observed at the 51th week of 2020 year.
That is rather important question which let us decide if it is important to increase testing rate to define more positive COVID cases. To answer this question some manipulations with dataset have been provided. Columns used at this answer have been separated to another data frame. This new dataframe have been reshaped from long to the wide format. Also format of the date have been changed to weeks only, as all data cover the same 2020 year. That makes text at the x-axis more clear. First few row of the new dataset are described below.
we need to standerds the value of test rate and the postive
testing_positivity_subset = countries_subset[['country', 'year_week', 'testing_rate', 'positivity_rate']]#select columns
testing_positivity_subset = testing_positivity_subset.reset_index()#reset indices
testing_positivity_subset.head()
| index | country | year_week | testing_rate | positivity_rate | |
|---|---|---|---|---|---|
| 0 | 0 | Belgium | 2020-11 | 86.630732 | 17.482870 |
| 1 | 1 | Belgium | 2020-12 | 148.976227 | 27.036212 |
| 2 | 2 | Belgium | 2020-13 | 236.925101 | 31.465311 |
| 3 | 3 | Belgium | 2020-14 | 335.436570 | 25.737782 |
| 4 | 4 | Belgium | 2020-15 | 467.172199 | 18.270830 |
to make the data stander have the range from 0:1 we create stander function:
inputs :
ouputs:
def stander(data_frame, col1):
min_=data_frame[col1].min()
max_=data_frame[col1].max()
max_min_diff=max_-min_
data_frame[col1]=(data_frame[col1]-min_)/max_min_diff
return data_frame.head()
stander(testing_positivity_subset,'testing_rate') #create testing_rate as stander
| index | country | year_week | testing_rate | positivity_rate | |
|---|---|---|---|---|---|
| 0 | 0 | Belgium | 2020-11 | 0.005532 | 17.482870 |
| 1 | 1 | Belgium | 2020-12 | 0.009736 | 27.036212 |
| 2 | 2 | Belgium | 2020-13 | 0.015666 | 31.465311 |
| 3 | 3 | Belgium | 2020-14 | 0.022307 | 25.737782 |
| 4 | 4 | Belgium | 2020-15 | 0.031189 | 18.270830 |
stander(testing_positivity_subset,'positivity_rate') #create positivity_rate as stander
| index | country | year_week | testing_rate | positivity_rate | |
|---|---|---|---|---|---|
| 0 | 0 | Belgium | 2020-11 | 0.005532 | 0.555167 |
| 1 | 1 | Belgium | 2020-12 | 0.009736 | 0.859094 |
| 2 | 2 | Belgium | 2020-13 | 0.015666 | 1.000000 |
| 3 | 3 | Belgium | 2020-14 | 0.022307 | 0.817786 |
| 4 | 4 | Belgium | 2020-15 | 0.031189 | 0.580235 |
testing_positivity_long = pd.melt(testing_positivity_subset, id_vars=['country', 'year_week'], value_vars=['testing_rate', 'positivity_rate'])#reshape to long
testing_positivity_long['year_week'] = testing_positivity_long['year_week'].str.replace('2020-', '')#change format
testing_positivity_long.head()#view
| country | year_week | variable | value | |
|---|---|---|---|---|
| 0 | Belgium | 11 | testing_rate | 0.005532 |
| 1 | Belgium | 12 | testing_rate | 0.009736 |
| 2 | Belgium | 13 | testing_rate | 0.015666 |
| 3 | Belgium | 14 | testing_rate | 0.022307 |
| 4 | Belgium | 15 | testing_rate | 0.031189 |
As dataset is rather huge there is no chance to make conclusions based on the table format. Scatterplot below demonstrate correlations between testing rate and positivity rate by dates and countries. Plots are separated by facets with different scales to make plot interpretaion easier, as levels are quite different per countries.
fig = plt.gcf()
fig.set_size_inches(12, 8)
g = sns.FacetGrid(testing_positivity_long, col="country", hue="variable", col_wrap=4, sharey=False)
g.map_dataframe(sns.scatterplot, x="year_week", y="value")
g.set_axis_labels("Week of the year", "Statistics")
g.add_legend()
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Testing rate and positivity rate per 100 000 population\n')
for ax in g.axes.flat:
box = ax.get_position()
ax.set_position([box.x0,box.y0,box.width*0.7,box.height])
labels = ax.get_xticklabels() # get x labels
for i,l in enumerate(labels):
if(i%3 != 0): labels[i] = '' # skip 1,2 labels, leave 3 label
ax.set_xticklabels(labels, rotation=90) # set new labels
plt.show()
<ipython-input-9-f3556df481a1>:15: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_xticklabels(labels, rotation=90) # set new labels
to calculate the correlation for each counter we will define afunction
counters=testing_positivity_subset.groupby(['country']).size().index
def corr_estimate(data_frame,col1,col2):
print('the correlation between ',col1,'and',col2,' can be as following:\n')
print('*************************************************************************\n')
postive=[]
poor=[]
negative=[]
for i in counters:
counter=data_frame[data_frame['country']==i]
corr=counter[col1].corr(counter[col2])#correlation
if (corr >=.6):
postive.append([i,corr])
elif(corr<.2):
negative.append([i,corr])
else:
poor.append([i,corr])
for i in postive:
print('strong postive correlations {0:.4f}'.format(i[1]),'for' ,i[0])
print('******************************************************\n')
for i in negative:
print('strong negative correlations {0:.4f}'.format(i[1]),'for' ,i[0])
print('******************************************************\n')
for i in poor:
print('poor correlations {0:.4f}'.format(i[1]),'for' ,i[0])
corr_estimate(testing_positivity_subset,'testing_rate','positivity_rate')#estimate correlations for each counter between 'testing_rate','positivity_rate'
the correlation between testing_rate and positivity_rate can be as following: ************************************************************************* strong postive correlations 0.8423 for Czechia strong postive correlations 0.6266 for Estonia strong postive correlations 0.8957 for Latvia strong postive correlations 0.6383 for Malta strong postive correlations 0.9171 for Slovenia ****************************************************** strong negative correlations 0.1920 for Belgium strong negative correlations -0.1945 for Denmark strong negative correlations 0.0827 for France strong negative correlations -0.0026 for Ireland strong negative correlations 0.0951 for Sweden ****************************************************** poor correlations 0.4348 for Germany poor correlations 0.5511 for Greece poor correlations 0.2156 for Italy poor correlations 0.2447 for Netherlands poor correlations 0.3638 for Portugal
This question involves data along long period so, it is also reasonable to use visualizations for the answer. Columns needed for this answer have been organized to the separate dataset and reshaped to wide format, like for the previous answer. Head of the created dataset describes that both new cases and deaths are combined to the columns of the type variable-value. Year-week format also have been changed to make axis more readable.
testing_cases_subset = countries_subset[['country', 'year_week', 'cases_weekly', 'deaths_weekly']]#select columns
testing_cases_subset = testing_cases_subset.reset_index()#reset indices
stander(testing_cases_subset,'cases_weekly')# chnage range of cases_weekly to be 0: 1
stander(testing_cases_subset,'deaths_weekly')# chnage range of deaths_weekly to be 0: 1
| index | country | year_week | cases_weekly | deaths_weekly | |
|---|---|---|---|---|---|
| 0 | 0 | Belgium | 2020-11 | 0.004641 | 0.004909 |
| 1 | 1 | Belgium | 2020-12 | 0.012351 | 0.041964 |
| 2 | 2 | Belgium | 2020-13 | 0.022865 | 0.138401 |
| 3 | 3 | Belgium | 2020-14 | 0.026480 | 0.264608 |
| 4 | 4 | Belgium | 2020-15 | 0.026181 | 0.316548 |
testing_cases_long = pd.melt(testing_cases_subset, id_vars=['country', 'year_week'], value_vars=['deaths_weekly','cases_weekly'])#reshape to long
testing_cases_long['year_week'] = testing_cases_long['year_week'].str.replace('2020-', '')#change format
testing_cases_long.head()#view
| country | year_week | variable | value | |
|---|---|---|---|---|
| 0 | Belgium | 11 | deaths_weekly | 0.004909 |
| 1 | Belgium | 12 | deaths_weekly | 0.041964 |
| 2 | Belgium | 13 | deaths_weekly | 0.138401 |
| 3 | Belgium | 14 | deaths_weekly | 0.264608 |
| 4 | Belgium | 15 | deaths_weekly | 0.316548 |
Using facet barplot we can easy compare weekly cases and deaths by bountry and week of the year. Plot describes positive association, and we can estimate it numerically using Pearson correlation.
fig = plt.gcf()
fig.set_size_inches(12, 8)
g = sns.FacetGrid(testing_cases_long, hue="variable", col="country", col_wrap=4, sharey=False)
g.map_dataframe(sns.barplot, x="year_week", y="value")
g.set_axis_labels("Week of the year", "Statistics")
g.add_legend()
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Weekly statistics by new cases and deaths\n')
for ax in g.axes.flat:
box = ax.get_position()
ax.set_position([box.x0,box.y0,box.width*0.6,box.height])
labels = ax.get_xticklabels() # get x labels
for i,l in enumerate(labels):
if(i%3 != 0): labels[i] = '' # skip 1,2 labels, leave 3 label
ax.set_xticklabels(labels, rotation=90) # set new labels
plt.show()
corr_estimate(testing_cases_subset,'cases_weekly','deaths_weekly')#estimate correlations for each counter between 'cases_weekly','deaths_weekly'
the correlation between cases_weekly and deaths_weekly can be as following: ************************************************************************* strong postive correlations 0.8750 for Czechia strong postive correlations 0.6234 for Denmark strong postive correlations 0.8862 for Estonia strong postive correlations 0.7983 for Germany strong postive correlations 0.7400 for Greece strong postive correlations 0.6029 for Italy strong postive correlations 0.9642 for Latvia strong postive correlations 0.7722 for Malta strong postive correlations 0.8858 for Portugal strong postive correlations 0.6459 for Slovenia ****************************************************** ****************************************************** poor correlations 0.2827 for Belgium poor correlations 0.3194 for France poor correlations 0.2299 for Ireland poor correlations 0.3105 for Netherlands poor correlations 0.3313 for Sweden
testing_cases_subset['cases_weekly'].corr(testing_cases_subset['deaths_weekly'])#correlation
0.6252489951120447
over all correlation coefficient describe strong positive assosiation between weekly cases and weekly deaths.
This answer supposes some initial calculations. To answer it we need to calculate total number of COVID-associated deaths along observed period. Based on the calculations below Italy describes the highest rate - more than 75 thousand cases for 2020. The lowest number of deaths observed for Malta.
deaths_countries = countries_subset[['country_code', 'deaths_weekly']]#select countries
deaths_countries['country_code'] = deaths_countries['country_code'].str.lower()
deaths_countries = deaths_countries.groupby(['country_code']).sum()#summarize data
death_dict = deaths_countries.to_dict()#convert to dict
death_dict
<ipython-input-18-ced21a78334e>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy deaths_countries['country_code'] = deaths_countries['country_code'].str.lower()
{'deaths_weekly': {'be': 19750,
'cz': 12070,
'de': 26275,
'dk': 1374,
'ee': 246,
'el': 4957,
'fr': 65018,
'ie': 2255,
'it': 75297,
'lv': 680,
'mt': 220,
'nl': 11595,
'pt': 7118,
'se': 8985,
'si': 2891}}
To compare countries by their geographical positions we can use interactive map.
worldmap = pygal.maps.world.World() #init map
worldmap.title = 'COVID statistics across countries'# set the title of the map
worldmap.add('Deaths', death_dict['deaths_weekly']) # adding the countries
#display(SVG(worldmap.render(disable_xml_declaration=False)))
worldmap.render_in_browser()#render to interactive
#worldmap.render()
file://C:/Users/yalas/AppData/Local/Temp/tmpnzdy9kc2.html
Previous question let us estimate total number of deaths across 2020 because of COVID pandemic. But it is also important to know what is variability in week deaths across described countries. For that we used boxplot of weekly deaths per countries.
countries_subset['fatality'] = countries_subset['deaths_weekly'] / countries_subset['cases_weekly']#creta enew column
fig = px.box(countries_subset, x="country", y="fatality",
labels=dict(country="Country", fatality="Fatality rate")) #create plot
fig.update_layout(title_text="Fatality variations by countries")#add title
fig.show()#view
As each country has different population level we estimated fatality rate - number of deaths per population level. Italy still keeps the leader position by median fatality rate and maximum fatality rate. High median levels of fatality observed for france,Belgium, Greece and Latvia also. The lowest median fatality belongs to Estonia, Denmark, Malta and Portugal have the lowest variability of fatality.
This project describes end-to-end data analysis of COVID-19 related data, including data uploading, filtering, reshaping, transformation and visualization. The main aspects of the project are relevant to the weekly data of 2020 statistics by 15 countries:
Greece Malta,Italy,Portugal,Latvia,Denmark,Czechia ,Sweden ,Belgium,France, Netherlands, Slovenia, Estonia,Germany,Ireland
Along the study there were found that Denmark describes the highest testing rate along 2020 year. There were assosiation between testing rate and positivity rate some countiers is strong postive like Croatia some strong negative like Cyprus and some poor like Italy . like At the same time weekly deaths are positively associated with weekly new cases at the high level. Italy described the highest number of deaths along 2020 with more than 75k defined cases. It also has the highest fatality rate among 15 countries used for the analysis.
All further steps relevant this project are oriented for the 2021 data analysis, including factors of vaccination. As different virus stamps have been defined during last month it is important to include this information to the further investigations of COVID-related data.
How ECDC collects and processes COVID-19 data. https://www.ecdc.europa.eu/en/covid-19/data-collection
Sources - Worldwide data on COVID-19. https://www.ecdc.europa.eu/en/publications-data/sources-worldwide-data-covid-19
Data on hospital and ICU admission rates and current occupancy for COVID-19. https://www.ecdc.europa.eu/en/publications-data/download-data-hospital-and-icu-admission-rates-and-current-occupancy-covid-19
Data on testing for COVID-19 by week and country. https://www.ecdc.europa.eu/en/publications-data/covid-19-testing
Data on the weekly subnational 14-day notification rate of new COVID-19 cases. https://www.ecdc.europa.eu/en/publications-data/weekly-subnational-14-day-notification-rate-covid-19